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Abstract 


We highlight a pitfall when applying stochastic variational inference to general 
Bayesian networks. For global random variables approximated by an exponential 
family distribution, natural gradient steps, commonly starting from a unit length 
step size, are averaged to convergence. This useful insight into the scaling of ini¬ 
tial step sizes is lost when the approximation factorizes across a general Bayesian 
network, and care must be taken to ensure practical convergence. We experimen¬ 
tally investigate how much of the baby (well-scaled steps) is thrown out with the 
bath water (exact gradients). 

1 Introduction 

Stochastic variational inference is framed as maximizing a globaH variational parameter A, which 
is the natural parameter of a conjugate exponential distribution j2[. In this framework, stochastic 
gradient steps are taken along the natural gradient ID to optimize for A. A pleasing property of 
stochastic variational inference on a conjugate exponential distribution and approximation q( A) is 
that the gradient is automatically rescaled so that a unit-length step size will minimize it. For a 
general Bayesian network, where the global variational parameters are subdivided to parameterize 
different factors q, in the network’s variational approximation, the picture is less clear. Hoffman et 
al.’s appendix suggests a stochastic updating scheme like that of the global version El. We show 
here that the problem is more subtle in the general case, as component-wise noisy natural gradients 
can tightly couple variational parameters, and following the default recipe can sometimes lead to a 
scheme that “diverges” beyond recovery! 

These remarks are of particular value to the Xbox recommender system, which uses stochastic vari¬ 
ational inference in a Bayesian network on “worldwide” scale IB El- Some of the results presented 
in Sec.[3]are preliminary investigations that were done when designing the system in 2012. 

2 Variational Bayes 

A Bayesian network between the variables X = {x ; } defines the conditional dependency structure 
between them through their joint probability p(X) = p^jpa^). Following Fig. |TJ let pa(j') be 
the set of indexes of parents of random variable(s) x,; for notational convenience we let pa ? = {x/, : : 
k £ pa(j')} denote the parent variables. The variables in the network can be hidden or observed, 
X = {X ,X°}. Variational Bayes (VB) approximates the posterior p(X h |X°) with g(X h ), by 
maximizing the evidence lower bound 



The evidence lower bound is locally optimized with respect to local variational parameters. 
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Figure 1: A Bayesian network, indicating 
x,’s Markov blanket. The parents of x; are 
pa ; and its children x k £ chi. For a com¬ 
pact notation we also write k £ ch(i) as 
the index set of children, where it is clear 
from context. Each child k has parents 
x, and cpi (the co-parents with x,). The 
form of our notation loosely matches Winn 
and Bishop’s in j(3, as Alg.jTJcan be inter¬ 
preted as “stochastic variational message 
passing”. 


If i indexes the hidden variables, we factorize the approximation with 

^(x 11 )=■ 

i 


Let Kj^ti indicate the expectation taken over YIjM ( x j)- The bound can be maximized in a 
component-wise fashion by iteratively setting each q, (x,) to the maximum 


log Qi ( x ») = logp(x. i |pa l ) 


E 

fcGch(i) 


E 


'jfr 


logp(x fe |pa fc ) 


const 


( 1 ) 


In many practical networks there are some x,; for which number of children N t = |ch(i)| is large. 
In ID, x, is a topic-vocabulary distribution from which millions of documents are generated. In 
Sec. [3] and |4][5) the interaction is bilinear, where user x, and item x :] variables are combined to 
represent a user’s affinity to an item. Rather than summing over all ch(z) for each update in 0, 
we aim to stochastically approximate the expectations. It alleviates two problems: firstly, the sum 
contains many terms; secondly, the update depends on some q(x.j) which will be re-estimated, and 
the expense of fully estimating q{x.,) is lost as it too will be re-estimated. 


2.1 Conditionally conjugate models 

The updates in 0 are straightforward when the Bayesian network is conditionally conjugate; that 
is, when the distribution of x,, conditioned on pa,, is (a) drawn from an exponential family, and (b) 
is conjugate with respect to the distribution of pa,. We define the exponential family as 

logp(xj|paj) = r 7 i(pa i ) T 0 i(xi) + /j(x;) + gi { paj ( 2 ) 

where r/^pa,) is the natural parameter vector, <f>i(xi) forms the sufficient statistics, and f/,(pa,) 
defines the normalizing constant through t/j( paj = —log f exp{rji(pa i ) T (/>i(xi ) + /,;(x,)} dx,. 
We can view 0 as a “prior” over x,. Now consider a node x k £ ch, in Pig. |T| We subdivide pa fc , 
the parents of x kl into x, and its co-parents cp.,: 

l°gp(xfe|xj,cpj = J 7 fc(x i ,cp i ) T ^ fe (x fc ) + /(x fc ) + 5 (x s ,c Pi ) . 

We can view this as a contribution to the “likelihood” of x*. We include the co-parents as they 
are part of x/s Markov blanket. Through conjugacy, p(x ): |pa,, ; ) and p(x^|xj,cpj) have the same 
functional form with respect to Xj, so that we can rewrite p(x/ i: |x,. cp,,) in terms of the sufficient 
statistics <f>i(xi) by defining some function r) kl with 

logp(x fc |x i ,cp i ) = rj ki {x kl cpi) 7, (j)i{xi) + h(xk,cpi) . 

We furthermore parameterize the q(xj) distributions in terms of their natural parameters. To distin¬ 
guish them, we denote their natural parameters by A,, and define A = [Ai,..., A/]: 

log 5 i(xj|Ai) = A f(/>i(xi) + fi(x.i) +gi(Xi) . (3) 


2.2 Variational Bayes updates and their stochastic version 

Returning to <|TJ, we can write 


log<?*(xi) = 


Vi( Pai)+ r 1ki{^k,cp i ) 

fcGch(i) 


Mx-i) + fi(Xi) + const, 


(4) 
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from which we can directly read off the updated natural parameter A* through (|3j. Notice now 
that r/i is a multi-linear function of the random variables pa,, i.e. it is linear in each parent ran¬ 
dom variable. In the same way rj ki is a multi-linear function of the random variables x/ ;: and cp, : . 
Furthermore, q factorizes over these variables (except where they are observed, of course). We can 
therefore reparameterize 0 in terms of expectations over qj, i ^ j with 


E 


Vi( P a i) = n, ({% 0 j(xj)]} 


E 




'i/i 


r) ki (x kl cpj = rjki ( Efc 


= Vi 


jGpa(i') 

M X k) , {Ej } 


jeep (i) 


— V ki 


Algorithm 1: Stochastic Variational Bayes 


or convergence do 


for t = 1 to t max 
Pt = (t + r)- K 

for each hidden x, do 

C f- C random nodes from ch, 

r'eitfEai 

option (a): A* «- (1 - p t ) A, + p t Af mp 

end for 

option (b): A «— (1 — p t ) A + p ( A temp 

end for 


A; from its old value to A 
the gradient Va, £, and Sec. 


This is a key ingredient of algorithms like vari¬ 
ational message passing (When Xfc is ob¬ 
served, cpki^-k ) is kept as is, as it is averaged 
over a delta function.) The variational update 
in 0 becomes 

A* = f}i + Vki ■ (5) 

fcGch(i) 

The update is a step along the natural gradient 
m. equivalent to setting the gradient to zero 
by solving for the zero of the derivative of £ 
with respect to A,. In particular, (|5| u pdates 
using a step of unit length along the natural gradient. Sec. A.l derives 


A.2 


states its natural form \7\ £. 


When Ni = |ch(i)| is large, not all the child nodes might be accessed in reasonable time. Further¬ 
more, when g(xj) is re-estimated, the (previous) large computation is discarded and recomputed. 
We may alternatively consider a subsample of nodes from chip) to determine the sufficient statis¬ 
tics. By placing a uniform distribution p) on the atoms rj k ,, the update from (JsJ is equivalent to 
A, = fji + Ni Ep [fj], This expectation can be estimated in many ways. Let set C be a sample of C 
children from ch, without replacement and let 

a r p =«*+§£«!*■ (6) 

fcGC 

Taking expectations gives A* = E[A* omp ] = fj, + jV,Ep, [fj]. With p t —> 0, = 00 anc l 

Y^tL i Pt < oo, these stochastic natural gradients are used in Alg. [I] which is a stochastic version 
of variational message passing. In Alg. [lj scalar k £ (1,1 is a forgetting rate, while delay r > 0 
discounts early iterations more. 

There are two options in Alg. [I] For option (a), the mean value of the parameters of q(X h ) is 
periodic in /, the number of factors in q, and convergence to a local optimum can also be guaranteed 
for /-dependent mean values 0. Option (b) is the update scheme given in |2). 


3 Bayesian matrix factorization 

To illustrate a general Bayesian network, we factorize a sparse matrix of a subsample of a million 
entries in the Netflix data set (M = 4805 users and N = 16015 items). Each entry r mn is user m’s 
rating of movie n on a five-star rating scale. For illustrative purposes, consider a Gaussian bilinear 
ratings model 

V„) = J\f (Tmn | U m V n , 1) 

for user parameter vector u m € R A and item trait vector v ra € R K . We place a factorized prior 
A f(u m k ! 0, 1) on each of the entries of u m and v n . We choose a fully factorized Gaussian ap¬ 
proximation q(U) = q(u m k ), with a similar approximation for q(V). The VB update for 

q{umk ) therefore incorporates 2 1\ — 1 co-parents due to the inner product. With the Gaussian’s 
natural parameters being its precision and mean-times-precision, it is 

\ temp _ f pteC \ _ f l\ Nm \ ' f Vatg['U n fc] + Eg[r? n / C ] \ _ 

mk - l mean ■ prec J ~ l 0 J C ^ lE,Kt](r mn - - j E,[« m fe']E,[r nfc /]) J ' ; 
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component-wise updates 


x •] q 7 component-wise gradients, global updates 


x 10 7 






component-wise updates x -jq 6 component-wise gradients, global updates 




Figure 2: Convergence of C[q\ with pt = t~ 0 ' 6 . Alg.[I]s option (a) is shown in the left column; option (b) 
is shown in the right column. The x-axes are on a logarithmic scale. The global stochastic gradient is not in 
its natural form, and the effect of a large variance in the gradient estimate and overshooting with too large step 
sizes of pt € (0,1] is clearly visible for small C. Note that r mn ’s can be revisited over multiple loops in Alg.ln 
Different magnifications of the same two convergence plots for options (a) and (b) are shown in the three rows 
of graphs. 


Fig.[2]shows C\q] as a function of the number of times that individual ratings (observed nodes) r mn 
are accessed or queried (using K = 5). The value of the bound is shown for the use of at most 
C = 1,..., 20 children when estimating the gradient of each random variable with (|6]i and 0. 
Both options (a) and (b) in Alg. |T] “diverge” in a numerically unrecoverable way when C is small. 
This is due to the global gradient not being in its natural form, and using a step size of p t £ (0,1] 
that is too big, overshooting with too large gradient steps. 

Full VB, shown in black in Fig. 2] implicitly uses pt = 1 in 0. As the stochastic natural gradient 
depends on other A y , much smal er initial step sizes are required to not “overshoot”. The variance of 
the gradient is simply too large compared to p t . Fig.[2]illustrates this problem for general Bayesian 
networks', see especially the top left figure. 
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In practice, we can overcome this problem overcome by starting with sufficiently small initial step 
sizes p t <C 1. For C = 1 in option (a) this was starting from pi = and Pi = gj for option (b). 
In 00 the value of C varied depending on a user or item’s usage, and there p t = 1 was fixed for 
the first ten iterations before slowly decreasing it. 

Have we thrown the baby (well-scaled steps) out with the bath water (exact gradients)? Maybe 
some. As shown by this short note, it is still an open question. 
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A Gradients 

In this Appendix, we derive the component-wise gradients and their natural version, and present 
basic intuition for why steps down the stochastic gradient can be taken. 

A.l Component-wise gradients 

The function that’s minimized to find |5]l is £(A, ) below. It is a function of Aj, whilst keeping all 
other Aj for j / / fixed: 


£(Aj)=E g logp(xj|paj) + Y logiKxfclxi.cpJ-loggi(xi|A») 


fcGch(i) 


= E g r/ l (pa l ) T ^ i (x i ) + /»(xi) + ^(paj 



k£ch(i) 


- Af </>i(xj) - /»(xi) - §i{ Aj) 


Because of local conjugacy, p(xfc |xj, cpj can be rewritten in terms of the sufficient statistics </>, (x,) 
through a multi-linear function rp., of the random variables x fc and cp ( to yield 


£(Ai) = Eg T7 4 (pa i ) r </>i(x i ) + /i(xi) + 5i(pa i ) + 



fcGch(i) 


Af^i(xi) - /j(xj) - 5i(Aj) 
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Taking expectations over q gives, as function of A,, 


£( Ai) 



T 

Ej [0i(xj)] - AfE i [^i(xj)] 


ff*(Aj) + const , 


( 8 ) 


with gi(Xi) = — log f exp{Af </>j(xj) + /,(x,)} dxj. The derivatives of the log partition function 
—ffi(Aj) with respect to A., give the expected sufficient statistics 


-Vgi(Xi) = Ej [</>i(x j; )] 
-V 2 gi(Xi) = VEj [^j(xi)] 


= Ej 


(<M x i) - Ej [0j(xj)] J ^j(xj) 


Ej [0j(xj)]) 


T" 


= COVj [0i(Xj)] , 


and by using properties of the exponential family, the gradient of C with respect to Aj is therefore 


V Ai £(Aj) 


cov 


[^(xj)] n, 


+ El Vki ~ Aj 

feGch(i) / 


(9) 


Solving for V Ai £(Aj) = 0 yields the component-wise VB update A* = rji + XEch(i) Vki that 
we find in (|5ji. Gradient V A £ depends on Aj through cov, [d, (x,)| and Aj, and in the next section 
we will show that the natural gradient V Ai £ removes the dependency on covi[<pi (xj)], so that it is a 
linear function of Aj, with the minimum being attained by taking a step of length one along it. 


A.2 Component-wise natural gradients 

The Fisher information matrix of qi is 


G(Aj) = Ej 


= E,- 


(v Ai logg(xj|Aj)^ (v Ai logg(xj|Aj)^ 

(^i(xi) - Ej[</>j(xj)]^ ^j(xj) - Ej[</>j(xj)]^ 


= COVj[0j(Xj)] , 

and the component-wise natural gradient is obtained by multiplying it with V A £, yielding 

V Ai £(Aj) = G(Aj) _1 V Ai £(Aj) = fj l + E rj ki - A, . 

fcGch(i) 

A gradient descent along the natural gradient is taken with step length p > 0. Starting at point 
A-^ 1 ), gradient descent updates it to A,- 4 '* with 

Aj* } <- Af _1) + pV A (t-i>£(Aj) = Af _1) +p(fu+ E ^ - X t lS> ) 

V k£ch(i) / 

= ( i -p) A i t_ 1 ) +p(^+ E 

V fcGch(i) / 

= (1 — p) Aj ^ + pX* . 

When the above update is compared to we see that the minimum X'f 1 -t— A* is obtained by 
applying a step size of p = 1 along the natural gradient. 


A.3 Stochastic natural gradients: a bird’s eye view 

In this section an intuitive motivation will be provided for doing stochastic gradient descent using 
the natural gradient, as it was defined above in Sec. |A.2| The explanation favours an intuitive 
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understanding above mathematical rigour. Imagine that instead of A*, we have access to a sequence 


of samples {Aj emp ’ r }5. =1 , so that E[A' emp ] = A*. We can write the update A-^ A* recursively 
using the sample average 


i(t) 


i - 

t 


1 Va 


temp,r 


T= 1 


Define p t = 


In the running average, YltLi j = 

„2 



^temp,l 


and ESi 


< oo, and therefore 

Pt = oo and Pt < oo. In the running average with p t = each gradient sample 

is treated equally. However, instead of incorporating fraction ] of Aj empJ into the running average. 


we may erase a bit more from the “past memory” A 
AfmP’ 1 . How much more is permissible? 


(t-i) 


to include a bit more of the recent gradient 


Now define p t = t K . For k = the previous samples will be forgotten at a faster rate, and more 
of A- emp,t will be included through a| 4> «— (1 — f -1 / 2 )A- t_11 + i _1 / 2 A- cmp,t . However, at this 
rate past samples are forgotten too quickly, as both f -1 ^ 2 = oo and Etli(f _1 ^ 2 ) 2 = °°- F° r 
any n' > 1, both infinite sums will be finite, e.g. Y^tL i < °° an d XVi ) 2 < oo, an d 
the running average will cling on to old memories, and has too little capacity to incorporate recent 
gradient samples A* cmP:t . Between forgetting too quickly or not at all, a setting of «: G (|, 1] in 
pt = t~ K is therefore permissible. 


A.4 Converging with fickle neighbours 


The running average in Sec. [A3] can boldly start at p t = t~ K = 1 for 1=1, and from this unit 
length step along the natural gradient, accumulate gradient samples until convergence. However, it 
rests on the premise that neighbours A, for j / i from the Markov blanket of x, remain unchanged. 


If this premise does not hold, much smaller steps p t = (t + t)~ k with delay r > 0 are required. 
This is indeed the case. As Sec. [3] shows, a delay r > 0 that is sufficiently large for the stochastic 
gradient scheme to converge in practice is not known a priori. By explicitly stating the shorthand 
definitions of fj t and fp., in <[8]», it is clear that the other A, appear through multi-linear functions in 

= J [ < ^i( x i)]}jgpa(j)) + Vki(^‘k[<Pk('X-k)]i [^i( x j)] } JgC p( ? :)) J [ < A*( x i)] 

\ fcGch(i) J 

- AfEj [0i(xi)] - ffi(Aj) + const. 

If we now consider the global gradient Va-C(A) = [VaijC(Ai), ..., Vaj£(A/)], it is clear from 
the above form (multi-linear in all variables) that we can’t set the gradient to zero and solve for all 
A explicitly, as was done in (|5j. It is usually not even convex problem. 


The gradient steps are along the global natural gradient. It is defined as 

V a £(A) = G(A)- 1 V a £(A), 
with G(A) being the Fisher information matrix of q. 


G( A) = E, 


(V A log g(X h | A)) (V A logg(X h | A)) T 


G( A) is block-diagonal, as the covariance between 0j(x,;) and <f>j(x.j) is zero for i ^ j , due to the 
factorization of q. Its inverse is therefore also block-diagonal, and the natural gradient has the form 
V a £(A) = [V Al £(A!),..., Vaj£(A/)]. 


B Global batch samples 

An alternative to Alg. [T|is to take a batch data sample V of C g i 0 bai observed variables at the start of 
each iteration, and follow either each V A ,G(Ai) or V a £(A). This is outlined in Alg. [2] 
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shared sample; component-wise updates x -jq 7 shared sample; comp, gradients, global updates 








Figure 3: Convergence of C[q\ with pt = t °' 6 . Alg. [sjs option (a) is shown in the left column; option (b) is 
shown in the right column. The x-axes are on a logarithmic scale. 


Algorithm 2: Stochastic Variational Bayes 
1 : for t 1 to t max or convergence do 

2: p t = (t + T)~ K 

3: V <r- Cgiobai random nodes from X° 

4: for each hidden x; : i € pa (V) do 

5: Vi <r- {xfc £ Pn eh.;} 

6: ^temp fj. _|_ J2 keC rjki 

7: option (a): f— (1 — p t )K + Pt.A- cmp 

8: end for 

9: // updates of pa 4 etc. 

10: option (b): A •<— (1 — p t ) A + p t A temp 

11: end for 


Fig. [3] considers batch sizes of C g i 0 bai = 100 
to 1000, in intervals of 100. The convergence 
in Fig.[3]is much slower than that of Fig. [2] For 
the global update in option (b) in Alg.[2] the al¬ 
gorithm only converged on finite machine pre¬ 
cision when pi < ^2 was chosen (smaller for 
some Cgiobai settings), whereas for option (a) 
at least the algorithm converged from p\ = 1 
for all settings. 

In Alg. [2] pa(27) is the set of hidden variables 
that have a child node in V. Line 9 makes 
provision for updating variables (like hyper¬ 
parameters) that don’t have observed children; 
this was not required for our example. 
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